Toward genome assemblies for all marine vertebrates: current landscape and challenges

Abstract Marine vertebrate biodiversity is fundamental to ocean ecosystem health but is threatened by climate change, overharvesting, and habitat degradation. High-quality reference genomes are valuable foundational scientific resources that can inform conservation efforts. Consequently, global consortia are striving to produce reference genomes for representatives of all life. Here, we summarize the current landscape of available marine vertebrate reference genomes, including their phylogenetic diversity and geographic hotspots of production. We discuss key logistical and technical challenges that remain to be overcome if we are to realize the vision of a comprehensive reference genome library of all marine vertebrates.


Reviewer #1
Comment 1: As a brief commentary, this manuscript talks mainly about current landscape and challenges for marine vertebrates' genome assemblies.In general, the overall writing is good, and numerous data were collected to support the main conclusions.Only minor revisions are required before acceptance for publication.Extra editing is necessary.For example, under the section of "Challenges and opportunities", provide the full name for EBP.
Response 1: We have checked the text to ensure all abbreviations are defined.The full name for EBP was provided on the title page, as well as at first use in the text: "…, includes data on Earth BioGenome Project (EBP) species [1]…" Comment 2: Under the section of "Available genomes are predominantly derived from short-read sequencing technology": The authors are recommended to mention the current status of increasing availability of chromosome-level genome assemblies based on the Hi-C sequencing technology, although Hi-C appears in the Fig. 1.
Response 2: We agree with the reviewer that this is of interest but felt uncomfortable with the criteria for calling assemblies "chromosome-level" within the space constraints of this commentary.It is something we intend to expand upon in a future review.We have increased the emphasis by expanding the last sentence: "The current dominance of short-read technology likely represents a transitory lag phase, and we expect an imminent shift to long-read-based assemblies, scaffolded to chromosome-level with Hi-C proximity ligation sequencing, as costs decline and accessibility improves [7]."Comment 3: Therefore, under the section of "Challenges and opportunities", the authors may provide short discussions about the difficult in Hi-C sequencing and comparative genomics analysis (due to available data from different sequencing methods; see Fig. 1) of some marine vertebrates.
Response 3: The requirements for successful Hi-C scaffolding include questions of both sample and assembly quality.Whilst this is clearly of interest, it was beyond the scope and space constraints of this commentary.We have instead concentrated on highlighting more general concerns about available sample quality for sequencing and assembly.

Reviewer #2
The authors underscore the importance of marine vertebrate genome assemblies given the current climate crisis and considering the role the oceans and marine biodiversity play in its understanding and stabilization and how they would support the blue economy.They highlight the small number of marine vertebrate species available, how the distribution of the species is skewed towards resource rich regions and point out the challenges faced while trying to increase the efforts.I think the manuscript in general follows the guidelines for a Commentary article in this Journal.The manuscript is well written, the methods described are appropriate for the type of study described and the conclusion supports the data shown.However, there are 3 points that I'd like to be addressed.Comment 1.The manuscript, though following the guidelines of a Commentary article, feels like a presentation of the Ocean Genomes project.Is this the case?1.1 If so, it would be interesting to have more information about the project on a different type of manuscript.1.2 If not, it would be important to have a link to the project site (or a link to a site where the details of the project are described).Otherwise, and understanding that there might be a limit with the citations, there are other projects working on marine species that could be cited here as well.
Response 1: Reviewer #2 is correct to suppose that a more detailed description of the Ocean Genomes project will follow.We thought it useful to highlight the project here, and have provide a link as requested, in addition to emphasising that it is new.We have also now shifted some of the details into the Authors' Information section.
"For example, the new EBP-affiliated project Ocean Genomes (https://www.minderoo.org/oceanomics),has a specific focus on southern hemisphere marine vertebrates, particularly Indian and Indo-West Pacific fauna." We did not add links to more projects due to the citation limit of 10.We have instead tried to emphasise that the EBP umbrella contains many affiliated projects of interest: "Towards this goal, Genomes on a Tree [10] aims to synthesise metadata across all genome projects, which importantly includes assembly progress and data for Earth BioGenome Project (EBP) species and affiliated large-scale projects [1]."Comment 2. Given the progress of the different genome projects, and the pace at which the NCBI database is being updated, would it be possible to add when the data has been retrieved?Response 2: The date has been added.Comment 3. The resolution of the figures need to be improved.The font size or format makes it hard to read.
Response 3: The figures were supplied at 600 dpi resolution, so this may be a PDF issue.We will supply alternative high-resolution formats if required.
Editor comments: Please add info about missing the BGI T1K fishes project to the revised manuscript.
Response: Whilst we did originally have a section on annotation and transcriptomics, we cut it out due to space constraints.We therefore wondered whether this was supposed to refer to the Fish10K project?Due to the citation limit, we have tried to cover this and other affiliated projects under the blanket coverage of the Earth Biogenome Project (EBP).Please let us know if you would prefer us to increase the citation count.Have you included all the information requested in your manuscript?

No
If not, please give reasons for any omissions below.
Full details of the methods are described throughout the main text of the commentary, not in a dedicated methods section.Here, we summarise the current landscape of available marine vertebrate reference genomes including their phylogenetic diversity and geographic hotspots of production.We discuss key logistical and technical challenges that remain to be overcome if we are to realise the vision of a comprehensive reference genome library of all marine vertebrates.

Background
Reference genomes have become a fundamental tool for modern biology: reference genome-enabled applications have driven discoveries across medicine and healthcare, agriculture, biodiversity, ecology, conservation, and evolution.Reference genomes have been economically, logistically, technically, and computationally challenging to produce, leading to reliance on select model organisms to inform genomics-based research.In recent years, advances in sequencing technology and computational tools have facilitated the rapid and affordable production of reference genomes for non-model organisms across the tree of life, with ambitious global efforts underway to compile reference genomes for all eukaryotes [1].The enhanced capacity for large-scale production of reference genomes is timely, as inferences from reference genome-enabled research can inform conservation management practice in this period of unprecedented biodiversity loss and ecosystem decline [2].Here, we discuss the current landscape of available reference genomes with a focus on marine vertebrates in light of global recognition of the critical role that the oceans and marine biodiversity play in stabilising our climate and supporting a blue economy [3].

Main text
Reference genomes are unavailable for over 96% of marine vertebrate species At the time of writing this paper we assessed the number and phylogenetic diversity of reference genomes currently available for marine vertebrate species.Metadata for assemblies categorised as reference-level were obtained from the National Center for Biotechnology Information (NCBI) via their Datasets command line tool using "chordates" as the query (Accessed: 01/08/2023).Resulting entries were cross-referenced with all known marine vertebrate species (n=19,800) from the World Register of Marine Species [4], yielding a final dataset of 697 assemblies representing 688 unique species (Supplementary Table 1).Eighty-four percent of marine vertebrate orders are represented by at least one species (78/93, Table 1), highlighting the progress of existing global consortia with early strategies to target order-level representatives [5].Representation rapidly diminishes at lower taxonomic levels, however, covering only 41% of marine vertebrate families, 12% of genera, and 3.5% of species.Perciformes, the most speciose vertebrate order, has the highest number of reference genomes, yet still only 39% of Perciformes families are represented.Furthermore, orders with the highest percentage of threatened species according to the IUCN Red List are amongst the least represented.For example, 47 Rhinopristiformes species are listed as threatened [6], yet currently only 2 species from this order are represented by a reference genome.These data emphasise the need for continued efforts to capture the rich diversity of marine vertebrates, particularly the most vulnerable taxa that are likely to be of high conservation value.

Available genomes are predominantly derived from short-read sequencing technology
We next sought to characterise available reference genomes in terms of their quality and the sequencing technologies used for their generation.Noting that data on technology type is submitterdefined, and was unavailable for 219 assemblies (31%), Illumina short-read sequencing was most common (n=337), followed by Pacific Biosciences long-read sequencing (n=171, Figure 1A).This trend remained even when restricting analysis to reference genomes released in 2023 alone (Figure 1B).Regarding contiguity, the production of high-contiguity genomes (contig N50 >1Mbp) is accelerating (Figure 2A), along with a general trend of increasing contiguity over time (Figure 2B).The current dominance of short-read technology likely represents a transitory lag phase, and we expect an imminent shift to long-read-based assemblies, scaffolded to chromosome-level with Hi-C proximity ligation sequencing, as costs decline and accessibility improves [7].

Data production is biased toward higher-resourced regions
To examine the geographic distribution of marine vertebrate reference genome resources, we crossreferenced the assembled species with comprehensive sighting data extracted from the Ocean Biodiversity Information System (OBIS) full report [8].Projecting sightings data onto a world map revealed a clear spatial imbalance favouring fauna occurring in oceans and coastal regions of North America, the United Kingdom, and the east coast of Australia (Figure 3).Reference genome (and perhaps sightings) data representing the fauna of lower-resourced regions is comparatively lacking.This not only reveals a large data gap but emphasises the need for equitable representation across diverse marine regions to ensure a holistic understanding of our global ocean biodiversity.

Challenges and opportunities
Remarkable advances in sequencing and computational power are enabling more efficient production of high-quality marine vertebrate reference genomes, but some important challenges to scaling representation remain.The requirement for high-molecular-weight DNA input for long-read sequencing renders many archival samples unsuitable for high-quality reference genome production.Dedicated fresh sampling of marine vertebrates is logistically complex even for common and relatively accessible species.Many threatened and rare species may not be amenable to fresh sampling at all, with opportunities for reference genome assembly limited to species which can be live-sampled or obtained from poorer quality DNA sources, such as archival collections [9].This risks biasing genome production toward common species that are accessible to well-resourced data producers, exacerbating the gaps in our understanding of marine biodiversity and constraining the conservation management utility of reference genome-enabled research.Harmonisation of global initiatives is required to reduce duplication of effort, maximise resource efficiency, and ensure equitable representation of fauna across the phylogeny and diverse marine regions.Towards this goal, Genomes on a Tree [10] aims to synthesise metadata across all genome projects, which importantly includes assembly progress and data for Earth BioGenome Project (EBP) species and affiliated large-scale projects [1].Enabling dedicated hubs for local data production for underrepresented groups or geographic regions will also facilitate better representation of global marine biodiversity, including rare or threatened taxa with restricted distributions.Adhering to best practices of generating such resources in the place of sample provenance is important to promote fair and equitable sharing of benefits arising from the use of genetic resources [2,5].For example, the new EBP-affiliated project Ocean Genomes (https://www.minderoo.org/oceanomics),has a specific focus on southern hemisphere marine vertebrates, particularly Indian and Indo-West Pacific fauna.

Conclusions
The convergence of extended read lengths and high accuracy base calling represents a paradigm-shift in sequencing technology that has enabled a dramatic improvement in both the rate and quality of reference genome creation.Nevertheless, a considerable data gap remains, with over 96% of marine vertebrate species currently lacking a reference genome.By harnessing advancements in technology and bioinformatics, resource building in underrepresented regions, and continued global coordination and standardisation of efforts, the UN Decade of Ocean Science for Sustainable Development can also be the decade of marine vertebrate genomes.

Table 1. Summary of the NCBI-listed reference genomes available for marine vertebrates by Order
Class Order Marine species

Species with reference genome
Percentage with reference genome IUCN Red List % Threatened 1 [6] Myxini    The list of species for which a reference genome is available on NCBI was cross-referenced with all sightings for these species collated by Ocean Biodiversity Information System (OBIS).The total number of sightings is represented here.For ease of visualisation, the OBIS data was restricted to sightings of ray-finned (Actinopteri), and cartilaginous fish (Elasmobranchii and Holocephali) since the year 2000.
this manuscript to a special series or article collection?No Experimental design and statistics Full details of the experimental design and statistical methods used should be given in the Methods section, as detailed in our Minimum Standards Reporting Checklist.Information essential to interpreting the data presented should be made available in the figure legends.
as follow-up to "Experimental design and statistics Full details of the experimental design and statistical methods used should be given in the Methods section, as detailed in our Minimum Standards Reporting Checklist.Information essential to interpreting the data presented should be made available in the figure legends.Have you included all the information requested in your manuscript?" Resources A description of all resources used, including antibodies, cell lines, animals and software tools, with enough information to allow them to be uniquely identified, should be included in the Methods section.Authors are strongly encouraged to cite Research Resource Identifiers (RRIDs) for antibodies, model organisms and tools, where possible.Have you included the information requested as detailed in our Minimum Standards Reporting Checklist?No If not, please give reasons for any omissions below.as follow-up to "Resources A description of all resources used, including antibodies, cell lines, animals and software tools, with enough information to allow them to be uniquely identified, should be included in the Methods section.Authors are strongly encouraged to cite Research Resource Identifiers (RRIDs) for antibodies, model Full details of the methods are described throughout the main text of the commentary, not in a dedicated methods section.
Figure 1.Available reference genomes by sequencing technology and year of release.A) An upset plot showing the frequency of all reference genomes for marine vertebrates to date according to submitter-reported sequencing technologies used for their generation.Colours indicate one (red), two (cyan), three (green) or four (navy) technologies used, respectively.Data is missing for 219 assemblies (31%), these assemblies are not shown.B) The frequencies of reference genomes assembly releases according to both year of release and combinations of technology types; scaffolding refers to Hi-C and/or Bionano, short-read refers to Illumina and/or BGI-Seq, and long-read refers to PacBio and/or Nanopore technology.

Figure 2 .
Figure 2. The contiguity of available reference genomes by year of release.The frequency of assemblies for marine vertebrates with a contig N50 > 1Mbp by year of release (A), and box plots of contig N50 values by year of release (B).

Figure 3 .
Figure 3. Global sightings of marine vertebrates with reference genomes.The list of species for which a reference genome is available on NCBI was cross-referenced with all sightings for these species collated by Ocean Biodiversity Information System (OBIS).The total number of sightings is represented here.For ease of visualisation, the OBIS data was restricted to sightings of ray-finned (Actinopteri), and cartilaginous fish (Elasmobranchii and Holocephali) since the year 2000.

Figure 3
Figure 3 alternate format Click here to access/download;Figure;Fig3.pdf

Figure 1 Figure 2
Figure 1 alternate format Click here to access/download;Figure;Fig1-final.pdf Marine vertebrate biodiversity is fundamental to ocean ecosystem health, but is threatened by climate change, overharvesting and habitat degradation.High-quality reference genomes are valuable foundational scientific resources that can inform conservation efforts.Consequently, global consortia are striving to produce reference genomes for representatives of all life.
Powered by Editorial Manager® and ProduXion Manager® from Aries Systems Corporation organisms and tools, where possible.Yes Powered by Editorial Manager® and ProduXion Manager® from Aries Systems Corporation Abstract